Goto

Collaborating Authors

 language production


Meaningful Pose-Based Sign Language Evaluation

Jiang, Zifan, Leong, Colin, Moryossef, Amit, Göhring, Anne, Rios, Annette, Cory, Oliver, Ivashechkin, Maksym, Tarigopula, Neha, Zhang, Biao, Sennrich, Rico, Ebling, Sarah

arXiv.org Artificial Intelligence

We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.


Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation

Wu, Zetian, Zhou, Tianshuo, Lee, Stefan, Huang, Liang

arXiv.org Artificial Intelligence

Sign language translation from text to video plays a crucial role in enabling effective communication for Deaf and hard--of--hearing individuals. A major challenge lies in generating accurate and natural body poses and movements that faithfully convey intended meanings. Prior methods often neglect the anatomical constraints and coordination patterns of human skeletal motion, resulting in rigid or biomechanically implausible outputs. To address this, we propose a novel approach that explicitly models the relationships among skeletal joints--including shoulders, arms, and hands--by incorporating geometric constraints on joint positions, bone lengths, and movement dynamics. During training, we introduce a parent-relative reweighting mechanism to enhance finger flexibility and reduce motion stiffness. Additionally, bone-pose losses and bone-length constraints enforce anatomically consistent structures. Our method narrows the performance gap between the previous best and the ground-truth oracle by 56.51%, and further reduces discrepancies in bone length and movement variance by 18.76% and 5.48%, respectively, demonstrating significant gains in anatomical realism and motion naturalness.


Text2Sign Diffusion: A Generative Approach for Gloss-Free Sign Language Production

Feng, Liqian, Wang, Lintao, Hu, Kun, Kong, Dehui, Wang, Zhiyong

arXiv.org Artificial Intelligence

Sign language production (SLP) aims to translate spoken language sentences into a sequence of pose frames in a sign language, bridging the communication gap and promoting digital inclusion for deaf and hard-of-hearing communities. Existing methods typically rely on gloss, a symbolic representation of sign language words or phrases that serves as an intermediate step in SLP. This limits the flexibility and generalization of SLP, as gloss annotations are often unavailable and language-specific. Therefore, we present a novel diffusion-based generative approach - Text2Sign Diffusion (Text2SignDiff) for gloss-free SLP. Specifically, a gloss-free latent diffusion model is proposed to generate sign language sequences from noisy latent sign codes and spoken text jointly, reducing the potential error accumulation through a non-autoregressive iterative denoising process. We also design a cross-modal signing aligner that learns a shared latent space to bridge visual and textual content in sign and spoken languages. This alignment supports the conditioned diffusion-based process, enabling more accurate and contextually relevant sign language generation without gloss. Extensive experiments on the commonly used PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method, achieving the state-of-the-art performance.


A Transformer-Based Framework for Greek Sign Language Production using Extended Skeletal Motion Representations

Pratikaki, Chrysa, Filntisis, Panagiotis, Katsamanis, Athanasios, Roussos, Anastasios, Maragos, Petros

arXiv.org Artificial Intelligence

Building on To address communication barriers between the DHH (Deaf and insights from previous research, we propose a deep learning model Hard-of-Hearing) and the hearing communities, the field of Sign for Sign Language Production (SLP), which to our knowledge is Language Processing has emerged at the intersection of linguistics, the first attempt on Greek SLP. We tackle this task by utilizing a computer vision, and machine learning. Sign Language Processing transformer-based architecture that enables the translation from encompasses a variety of tasks aimed at bridging the gap between text input to human pose keypoints, and the opposite. We evaluate DHH and hearing communities by enabling the automatic translation, the effectiveness of the proposed pipeline on the Greek SL dataset and generation of sign language. The most critical components Elementary23, through a series of comparative analyses and ablation of an effective sign language system are Sign Language Translation studies. Our pipeline's components, which include data-driven (SLT), and Sign Language Production (SLP). In this paper, we gloss generation, training through video to text translation and a primarily focus on Sign Language Production (SLP).


Beyond Words: AuralLLM and SignMST-C for Precise Sign Language Production and Bidirectional Accessibility

Li, Yulong, Zhang, Yuxuan, Tang, Feilong, Zhou, Mian, Lu, Zhixiang, Xue, Haochen, Wang, Yifang, Dang, Kang, Su, Jionglong

arXiv.org Artificial Intelligence

Although sign language recognition aids non-hearing-impaired understanding, many hearing-impaired individuals still rely on sign language alone due to limited literacy, underscoring the need for advanced sign language production and translation (SLP and SLT) systems. In the field of sign language production, the lack of adequate models and datasets restricts practical applications. Existing models face challenges in production accuracy and pose control, making it difficult to provide fluent sign language expressions across diverse scenarios. Additionally, data resources are scarce, particularly high-quality datasets with complete sign vocabulary and pose annotations. To address these issues, we introduce CNText2Sign and CNSign, comprehensive datasets to benchmark SLP and SLT, respectively, with CNText2Sign covering gloss and landmark mappings for SLP, and CNSign providing extensive video-to-text data for SLT. To improve the accuracy and applicability of sign language systems, we propose the AuraLLM and SignMST-C models. AuraLLM, incorporating LoRA and RAG techniques, achieves a BLEU-4 score of 50.41 on the CNText2Sign dataset, enabling precise control over gesture semantics and motion. SignMST-C employs self-supervised rapid motion video pretraining, achieving a BLEU-4 score of 31.03/32.08 on the PHOENIX2014-T benchmark, setting a new state-of-the-art. These models establish robust baselines for the datasets released for their respective tasks.


Learning to Write Rationally: How Information Is Distributed in Non-Native Speakers' Essays

Tang, Zixin, van Hell, Janet G.

arXiv.org Artificial Intelligence

People tend to distribute information evenly in language production for better and clearer communication. In this study, we compared essays written by second language learners with various native language (L1) backgrounds to investigate how they distribute information in their non-native language (L2) production. Analyses of surprisal and constancy of entropy rate indicated that writers with higher L2 proficiency can reduce the expected uncertainty of language production while still conveying informative content. However, the uniformity of information distribution showed less variability among different groups of L2 speakers, suggesting that this feature may be universal in L2 essay writing and less affected by L2 writers' variability in L1 background and L2 proficiency.


An Open-Source American Sign Language Fingerspell Recognition and Semantic Pose Retrieval Interface

Thomas, Kevin Jose

arXiv.org Artificial Intelligence

This paper introduces an open-source interface for American Sign Language fingerspell recognition and semantic pose retrieval, aimed to serve as a stepping stone towards more advanced sign language translation systems. Utilizing a combination of convolutional neural networks and pose estimation models, the interface provides two modular components: a recognition module for translating ASL fingerspelling into spoken English and a production module for converting spoken English into ASL pose sequences. The system is designed to be highly accessible, user-friendly, and capable of functioning in real-time under varying environmental conditions like backgrounds, lighting, skin tones, and hand sizes. We discuss the technical details of the model architecture, application in the wild, as well as potential future enhancements for real-world consumer applications.


Universal Gloss-level Representation for Gloss-free Sign Language Translation and Production

Hwang, Eui Jun, Cho, Sukmin, Lee, Huije, Yoon, Youngwoo, Park, Jong C.

arXiv.org Artificial Intelligence

Sign language, essential for the deaf and hard-of-hearing, presents unique challenges in translation and production due to its multimodal nature and the inherent ambiguity in mapping sign language motion to spoken language words. Previous methods often rely on gloss annotations, requiring time-intensive labor and specialized expertise in sign language. Gloss-free methods have emerged to address these limitations, but they often depend on external sign language data or dictionaries, failing to completely eliminate the need for gloss annotations. There is a clear demand for a comprehensive approach that can supplant gloss annotations and be utilized for both Sign Language Translation (SLT) and Sign Language Production (SLP). We introduce Universal Gloss-level Representation (UniGloR), a unified and self-supervised solution for both SLT and SLP, trained on multiple datasets including PHOENIX14T, How2Sign, and NIASL2021. Our results demonstrate UniGloR's effectiveness in the translation and production tasks. We further report an encouraging result for the Sign Language Recognition (SLR) on previously unseen data. Our study suggests that self-supervised learning can be made in a unified manner, paving the way for innovative and practical applications in future research.


T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text

Yin, Aoxiong, Li, Haoyuan, Shen, Kai, Tang, Siliang, Zhuang, Yueting

arXiv.org Artificial Intelligence

In this work, we propose a two-stage sign language production (SLP) paradigm that first encodes sign language sequences into discrete codes and then autoregressively generates sign language from text based on the learned codebook. However, existing vector quantization (VQ) methods are fixed-length encodings, overlooking the uneven information density in sign language, which leads to under-encoding of important regions and over-encoding of unimportant regions. To address this issue, we propose a novel dynamic vector quantization (DVA-VAE) model that can dynamically adjust the encoding length based on the information density in sign language to achieve accurate and compact encoding. Then, a GPT-like model learns to generate code sequences and their corresponding durations from spoken language text. Extensive experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method. To promote sign language research, we propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.Experimental analysis on PHOENIX-News shows that the performance of our model can be further improved by increasing the size of the training data. Our project homepage is https://t2sgpt-demo.yinaoxiong.cn.


What is a word?

Murphy, Elliot

arXiv.org Artificial Intelligence

"Despite 2,400 years or so of trying, it is unclear that anyone has ever come up with an adequate definition of any word whatsoever, even the simplest." Surprisingly few linguists and philosophers have a clear model of what a word is, even though words impact basically every aspect of human life. Researchers that regularly publish academic papers about language often rely on outdated, or inaccurate, assumptions about wordhood. As in all scientific disciplines, we have two notions to consider: 1. Our intuitive concept of'word' (which we all have, even though it can be vague, and sometimes hard to articulate fully, like most complex concepts). This is no different from other scientific concepts - for example, 'water' has a very intuitive meaning, but it also is linked to much more technical, formal notions emerging from chemistry and physics (Murphy 2023). This short pedagogical document outlines what the lexicon is most certainly not (though is often mistakenly taken to be), what it might be (based on current good theories), and what some implications for experimental design are. The central features of lexical items have no connection with sensorimotor instructions.